Class-based Prediction Errors to Categorize Text with Out-of-vocabulary Words
نویسندگان
چکیده
Common approaches to text categorization essentially rely either on n-gram counts or on word embeddings. This presents important difficulties in highly dynamic or quickly-interacting environments, where the appearance of new words and/or varied misspellings is the norm. To better deal with these issues, we propose to use the error signal of class-based language models as input to text classification algorithms. In particular, we train a next-character prediction model for any given class, and then exploit the error of such class-based models to inform a neural network classifier. This way, we shift from the ‘ability to describe’ seen documents to the ‘ability to predict’ unseen content. Preliminary studies using out-of-vocabulary splits from abusive tweet data show promising results, outperforming competitive text categorization strategies by 4–11%.
منابع مشابه
Class-based Prediction Errors to Detect Hate Speech with Out-of-vocabulary Words
Common approaches to text categorization essentially rely either on n-gram counts or on word embeddings. This presents important difficulties in highly dynamic or quickly-interacting environments, where the appearance of new words and/or varied misspellings is the norm. A paradigmatic example of this situation is abusive online behavior, with social networks and media platforms struggling to ef...
متن کاملSpoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting
Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...
متن کاملClass-Based N-Gram Language Model for New Words Using Out-of-Vocabulary to In-Vocabulary Similarity
Out-of-vocabulary (OOV) words create serious problems for automatic speech recognition (ASR) systems. Not only are they missrecognized as in-vocabulary (IV) words with similar phonetics, but the error also causes further errors in nearby words. Language models (LMs) for most open vocabulary ASR systems treat OOV words as a single entity, ignoring the linguistic information. In this paper we pre...
متن کاملL2 Vocabulary Learning and the Use of Reading Tasks: Manipulating the Involvement Load Index
As Schmidt (2008) states, deeper engagement with new vocabulary as induced by tasks clearly increases the chances of learning those words. This engagement is theoretically clarified by the involvement load hypothesis (ILH, Laufer and Hulstijn, 2001), based on which the involvement index of each task can be measured. The present study was designed to test ILH by evaluating the impact of 4 differ...
متن کاملL2 Vocabulary Learning and the Use of Reading Tasks: Manipulating the Involvement Load Index
As Schmidt (2008) states, deeper engagement with new vocabulary as induced by tasks clearly increases the chances of learning those words. This engagement is theoretically clarified by the involvement load hypothesis (ILH, Laufer and Hulstijn, 2001), based on which the involvement index of each task can be measured. The present study was designed to test ILH by evaluating the impact of 4 differ...
متن کامل